{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "24gYiJcWNlpA"
      },
      "source": [
        "##### Copyright 2020 Google LLC"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "cellView": "form",
        "id": "ioaprt5q5US7"
      },
      "outputs": [],
      "source": [
        "#@title Licensed under the Apache License, Version 2.0 (the \"License\");\n",
        "# you may not use this file except in compliance with the License.\n",
        "# You may obtain a copy of the License at\n",
        "#\n",
        "# https://www.apache.org/licenses/LICENSE-2.0\n",
        "#\n",
        "# Unless required by applicable law or agreed to in writing, software\n",
        "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
        "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
        "# See the License for the specific language governing permissions and\n",
        "# limitations under the License."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "ItXfxkxvosLH"
      },
      "source": [
        "# Graph-based Neural Structured Learning in TFX\n",
        "\n",
        "This tutorial describes graph regularization from the\n",
        "[Neural Structured Learning](https://www.tensorflow.org/neural_structured_learning/)\n",
        "framework and demonstrates an end-to-end workflow for sentiment classification\n",
        "in a TFX pipeline."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "vyAF26z9IDoq"
      },
      "source": [
        "Note: We recommend running this tutorial in a Colab notebook, with no setup required!  Just click \"Run in Google Colab\".\n",
        "\n",
        "<div class=\"devsite-table-wrapper\"><table class=\"tfo-notebook-buttons\" align=\"left\">\n",
        "<td><a target=\"_blank\" href=\"https://www.tensorflow.org/tfx/tutorials/tfx/neural_structured_learning\">\n",
        "<img src=\"https://www.tensorflow.org/images/tf_logo_32px.png\" />View on TensorFlow.org</a></td>\n",
        "<td><a target=\"_blank\" href=\"https://colab.research.google.com/github/tensorflow/tfx/blob/master/docs/tutorials/tfx/neural_structured_learning.ipynb\">\n",
        "<img src=\"https://www.tensorflow.org/images/colab_logo_32px.png\">Run in Google Colab</a></td>\n",
        "<td><a target=\"_blank\" href=\"https://github.com/tensorflow/tfx/tree/master/docs/tutorials/tfx/neural_structured_learning.ipynb\">\n",
        "<img width=32px src=\"https://www.tensorflow.org/images/GitHub-Mark-32px.png\">View source on GitHub</a></td>\n",
        "</table></div>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "z3otbdCMmJiJ"
      },
      "source": [
        "## Overview"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "ApxPtg2DiTtd"
      },
      "source": [
        "This notebook classifies movie reviews as *positive* or *negative* using the\n",
        "text of the review. This is an example of *binary* classification, an important\n",
        "and widely applicable kind of machine learning problem.\n",
        "\n",
        "We will demonstrate the use of graph regularization in this notebook by building\n",
        "a graph from the given input. The general recipe for building a\n",
        "graph-regularized model using the Neural Structured Learning (NSL) framework\n",
        "when the input does not contain an explicit graph is as follows:\n",
        "\n",
        "1.  Create embeddings for each text sample in the input. This can be done using\n",
        "    pre-trained models such as [word2vec](https://arxiv.org/pdf/1310.4546.pdf),\n",
        "    [Swivel](https://arxiv.org/abs/1602.02215),\n",
        "    [BERT](https://arxiv.org/abs/1810.04805) etc.\n",
        "2.  Build a graph based on these embeddings by using a similarity metric such as\n",
        "    the 'L2' distance, 'cosine' distance, etc. Nodes in the graph correspond to\n",
        "    samples and edges in the graph correspond to similarity between pairs of\n",
        "    samples.\n",
        "3.  Generate training data from the above synthesized graph and sample features.\n",
        "    The resulting training data will contain neighbor features in addition to\n",
        "    the original node features.\n",
        "4.  Create a neural network as a base model using Estimators.\n",
        "5.  Wrap the base model with the `add_graph_regularization` wrapper function,\n",
        "    which is provided by the NSL framework, to create a new graph Estimator\n",
        "    model. This new model will include a graph regularization loss as the\n",
        "    regularization term in its training objective.\n",
        "6.  Train and evaluate the graph Estimator model.\n",
        "\n",
        "In this tutorial, we integrate the above workflow in a TFX pipeline using\n",
        "several custom TFX components as well as a custom graph-regularized trainer\n",
        "component.\n",
        "\n",
        "Below is the schematic for our TFX pipeline. Orange boxes represent\n",
        "off-the-shelf TFX components and pink boxes represent custom TFX components.\n",
        "\n",
        "![TFX Pipeline](images/nsl/nsl-tfx.svg)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "EIx0r9-TeVQQ"
      },
      "source": [
        "## Upgrade Pip\n",
        "\n",
        "To avoid upgrading Pip in a system when running locally, check to make sure that we're running in Colab.  Local systems can of course be upgraded separately."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "-UmVrHUfkUA2"
      },
      "outputs": [],
      "source": [
        "try:\n",
        "  import colab\n",
        "  !pip install --upgrade pip\n",
        "except:\n",
        "  pass"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "nDOFbB34KY1R"
      },
      "source": [
        "## Install Required Packages"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "yDUe7gk_ztZ-"
      },
      "outputs": [],
      "source": [
        "!pip install -q -U \\\n",
        "  tfx==0.23.0 \\\n",
        "  neural-structured-learning \\\n",
        "  tensorflow-hub \\\n",
        "  tensorflow-datasets"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "1CeGS8G_eueJ"
      },
      "source": [
        "## Did you restart the runtime?\n",
        "\n",
        "If you are using Google Colab, the first time that you run the cell above, you must restart the runtime (Runtime > Restart runtime ...). This is because of the way that Colab loads packages."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "x6FJ64qMNLez"
      },
      "source": [
        "## Dependencies and imports"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "2ew7HTbPpCJH"
      },
      "outputs": [],
      "source": [
        "import apache_beam as beam\n",
        "import gzip as gzip_lib\n",
        "import numpy as np\n",
        "import os\n",
        "import pprint\n",
        "import shutil\n",
        "import tempfile\n",
        "import urllib\n",
        "import uuid\n",
        "pp = pprint.PrettyPrinter()\n",
        "\n",
        "import tensorflow as tf\n",
        "import neural_structured_learning as nsl\n",
        "\n",
        "import tfx\n",
        "from tfx.components.evaluator.component import Evaluator\n",
        "from tfx.components.example_gen.import_example_gen.component import ImportExampleGen\n",
        "from tfx.components.example_validator.component import ExampleValidator\n",
        "from tfx.components.model_validator.component import ModelValidator\n",
        "from tfx.components.pusher.component import Pusher\n",
        "from tfx.components.schema_gen.component import SchemaGen\n",
        "from tfx.components.statistics_gen.component import StatisticsGen\n",
        "from tfx.components.trainer.component import Trainer\n",
        "from tfx.components.transform.component import Transform\n",
        "from tfx.orchestration.experimental.interactive.interactive_context import InteractiveContext\n",
        "from tfx.proto import evaluator_pb2\n",
        "from tfx.proto import example_gen_pb2\n",
        "from tfx.proto import pusher_pb2\n",
        "from tfx.proto import trainer_pb2\n",
        "from tfx.utils.dsl_utils import external_input\n",
        "\n",
        "from tfx.types import artifact\n",
        "from tfx.types import artifact_utils\n",
        "from tfx.types import channel\n",
        "from tfx.types import standard_artifacts\n",
        "from tfx.types.standard_artifacts import Examples\n",
        "\n",
        "from tfx.dsl.component.experimental.annotations import InputArtifact\n",
        "from tfx.dsl.component.experimental.annotations import OutputArtifact\n",
        "from tfx.dsl.component.experimental.annotations import Parameter\n",
        "from tfx.dsl.component.experimental.decorators import component\n",
        "\n",
        "from tensorflow_metadata.proto.v0 import anomalies_pb2\n",
        "from tensorflow_metadata.proto.v0 import schema_pb2\n",
        "from tensorflow_metadata.proto.v0 import statistics_pb2\n",
        "\n",
        "import tensorflow_data_validation as tfdv\n",
        "import tensorflow_transform as tft\n",
        "import tensorflow_model_analysis as tfma\n",
        "import tensorflow_hub as hub\n",
        "import tensorflow_datasets as tfds\n",
        "\n",
        "print(\"TF Version: \", tf.__version__)\n",
        "print(\"Eager mode: \", tf.executing_eagerly())\n",
        "print(\n",
        "    \"GPU is\",\n",
        "    \"available\" if tf.config.list_physical_devices(\"GPU\") else \"NOT AVAILABLE\")\n",
        "print(\"NSL Version: \", nsl.__version__)\n",
        "print(\"TFX Version: \", tfx.__version__)\n",
        "print(\"TFDV version: \", tfdv.__version__)\n",
        "print(\"TFT version: \", tft.__version__)\n",
        "print(\"TFMA version: \", tfma.__version__)\n",
        "print(\"Hub version: \", hub.__version__)\n",
        "print(\"Beam version: \", beam.__version__)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "nGwwFd99n42P"
      },
      "source": [
        "## IMDB dataset\n",
        "\n",
        "The\n",
        "[IMDB dataset](https://www.tensorflow.org/datasets/catalog/imdb_reviews)\n",
        "contains the text of 50,000 movie reviews from the\n",
        "[Internet Movie Database](https://www.imdb.com/). These are split into 25,000\n",
        "reviews for training and 25,000 reviews for testing. The training and testing\n",
        "sets are *balanced*, meaning they contain an equal number of positive and\n",
        "negative reviews.\n",
        "Moreover, there are 50,000 additional unlabeled movie reviews."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "iAsKG535pHep"
      },
      "source": [
        "### Download preprocessed IMDB dataset\n",
        "\n",
        "The following code downloads the IMDB dataset (or uses a cached copy if it has already been downloaded) using TFDS. To speed up this notebook we will use only 10,000 labeled reviews and 10,000 unlabeled reviews for training, and 10,000 test reviews for evaluation."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "__cZi2Ic48KL"
      },
      "outputs": [],
      "source": [
        "train_set, eval_set = tfds.load(\n",
        "    \"imdb_reviews:1.0.0\",\n",
        "    split=[\"train[:10000]+unsupervised[:10000]\", \"test[:10000]\"],\n",
        "    shuffle_files=False)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "nE9tNh-67Y3W"
      },
      "source": [
        "Let's look at a few reviews from the training set:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "LsnHde8T67Jz"
      },
      "outputs": [],
      "source": [
        "for tfrecord in train_set.take(4):\n",
        "  print(\"Review: {}\".format(tfrecord[\"text\"].numpy().decode(\"utf-8\")[:300]))\n",
        "  print(\"Label: {}\\n\".format(tfrecord[\"label\"].numpy()))"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "0wG7v3rk-Cwo"
      },
      "outputs": [],
      "source": [
        "def _dict_to_example(instance):\n",
        "  \"\"\"Decoded CSV to tf example.\"\"\"\n",
        "  feature = {}\n",
        "  for key, value in instance.items():\n",
        "    if value is None:\n",
        "      feature[key] = tf.train.Feature()\n",
        "    elif value.dtype == np.integer:\n",
        "      feature[key] = tf.train.Feature(\n",
        "          int64_list=tf.train.Int64List(value=value.tolist()))\n",
        "    elif value.dtype == np.float32:\n",
        "      feature[key] = tf.train.Feature(\n",
        "          float_list=tf.train.FloatList(value=value.tolist()))\n",
        "    else:\n",
        "      feature[key] = tf.train.Feature(\n",
        "          bytes_list=tf.train.BytesList(value=value.tolist()))\n",
        "  return tf.train.Example(features=tf.train.Features(feature=feature))\n",
        "\n",
        "\n",
        "examples_path = tempfile.mkdtemp(prefix=\"tfx-data\")\n",
        "train_path = os.path.join(examples_path, \"train.tfrecord\")\n",
        "eval_path = os.path.join(examples_path, \"eval.tfrecord\")\n",
        "\n",
        "for path, dataset in [(train_path, train_set), (eval_path, eval_set)]:\n",
        "  with tf.io.TFRecordWriter(path) as writer:\n",
        "    for example in dataset:\n",
        "      writer.write(\n",
        "          _dict_to_example({\n",
        "              \"label\": np.array([example[\"label\"].numpy()]),\n",
        "              \"text\": np.array([example[\"text\"].numpy()]),\n",
        "          }).SerializeToString())"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "HdQWxfsVkzdJ"
      },
      "source": [
        "## Run TFX Components Interactively\n",
        "\n",
        "In the cells that follow you will construct TFX components and run each one interactively within the InteractiveContext to obtain `ExecutionResult` objects.  This mirrors the process of an orchestrator running components in a TFX DAG based on when the dependencies for each component are met."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "4aVuXUil7hil"
      },
      "outputs": [],
      "source": [
        "context = InteractiveContext()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "L9fwt9gQk3BR"
      },
      "source": [
        "### The ExampleGen Component\n",
        "In any ML development process the first step when starting code development is to ingest the training and test datasets.  The `ExampleGen` component brings data into the TFX pipeline.\n",
        "\n",
        "Create an ExampleGen component and run it."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "WdH4ql3Y7pT4"
      },
      "outputs": [],
      "source": [
        "input_data = external_input(examples_path)\n",
        "\n",
        "input_config = example_gen_pb2.Input(splits=[\n",
        "    example_gen_pb2.Input.Split(name='train', pattern='train.tfrecord'),\n",
        "    example_gen_pb2.Input.Split(name='eval', pattern='eval.tfrecord')\n",
        "])\n",
        "\n",
        "example_gen = ImportExampleGen(input=input_data, input_config=input_config)\n",
        "\n",
        "context.run(example_gen, enable_cache=True)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "IeUp6xCCrxsS"
      },
      "outputs": [],
      "source": [
        "for artifact in example_gen.outputs['examples'].get():\n",
        "  print(artifact)\n",
        "\n",
        "print('\\nexample_gen.outputs is a {}'.format(type(example_gen.outputs)))\n",
        "print(example_gen.outputs)\n",
        "\n",
        "print(example_gen.outputs['examples'].get()[0].split_names)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "0SXc2OGnDWz5"
      },
      "source": [
        "The component's outputs include 2 artifacts:\n",
        "* the training examples (10,000 labeled reviews + 10,000 unlabeled reviews)\n",
        "* the eval examples (10,000 labeled reviews)\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "pcPppPASQzFa"
      },
      "source": [
        "### The IdentifyExamples Custom Component\n",
        "To use NSL, we will need each instance to have a unique ID. We create a custom\n",
        "component that adds such a unique ID to all instances across all splits. We\n",
        "leverage [Apache Beam](https://beam.apache.org) to be able to easily scale to\n",
        "large datasets if needed."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "XHCUzXA5qeWe"
      },
      "outputs": [],
      "source": [
        "def make_example_with_unique_id(example, id_feature_name):\n",
        "  \"\"\"Adds a unique ID to the given `tf.train.Example` proto.\n",
        "\n",
        "  This function uses Python's 'uuid' module to generate a universally unique\n",
        "  identifier for each example.\n",
        "\n",
        "  Args:\n",
        "    example: An instance of a `tf.train.Example` proto.\n",
        "    id_feature_name: The name of the feature in the resulting `tf.train.Example`\n",
        "      that will contain the unique identifier.\n",
        "\n",
        "  Returns:\n",
        "    A new `tf.train.Example` proto that includes a unique identifier as an\n",
        "    additional feature.\n",
        "  \"\"\"\n",
        "  result = tf.train.Example()\n",
        "  result.CopyFrom(example)\n",
        "  unique_id = uuid.uuid4()\n",
        "  result.features.feature.get_or_create(\n",
        "      id_feature_name).bytes_list.MergeFrom(\n",
        "          tf.train.BytesList(value=[str(unique_id).encode('utf-8')]))\n",
        "  return result\n",
        "\n",
        "\n",
        "@component\n",
        "def IdentifyExamples(orig_examples: InputArtifact[Examples],\n",
        "                     identified_examples: OutputArtifact[Examples],\n",
        "                     id_feature_name: Parameter[str],\n",
        "                     component_name: Parameter[str]) -> None:\n",
        "\n",
        "  # Get a list of the splits in input_data\n",
        "  splits_list = artifact_utils.decode_split_names(\n",
        "      split_names=orig_examples.split_names)\n",
        "\n",
        "  for split in splits_list:\n",
        "    input_dir = os.path.join(orig_examples.uri, split)\n",
        "    output_dir = os.path.join(identified_examples.uri, split)\n",
        "    os.mkdir(output_dir)\n",
        "    with beam.Pipeline() as pipeline:\n",
        "      (pipeline\n",
        "       | 'ReadExamples' >> beam.io.ReadFromTFRecord(\n",
        "           os.path.join(input_dir, '*'),\n",
        "           coder=beam.coders.coders.ProtoCoder(tf.train.Example))\n",
        "       | 'AddUniqueId' >> beam.Map(make_example_with_unique_id, id_feature_name)\n",
        "       | 'WriteIdentifiedExamples' >> beam.io.WriteToTFRecord(\n",
        "           file_path_prefix=os.path.join(output_dir, 'data_tfrecord'),\n",
        "           coder=beam.coders.coders.ProtoCoder(tf.train.Example),\n",
        "           file_name_suffix='.gz'))\n",
        "\n",
        "  # For completeness, encode the splits names and payload_format.\n",
        "  # We could also just use input_data.split_names.\n",
        "  identified_examples.split_names = artifact_utils.encode_split_names(\n",
        "      splits=splits_list)\n",
        "  # TODO(b/168616829): Remove populating payload_format after tfx 0.25.0.\n",
        "  identified_examples.set_string_custom_property(\n",
        "      \"payload_format\",\n",
        "      orig_examples.get_string_custom_property(\"payload_format\"))\n",
        "\n",
        "  return"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "ZtLxNWHPO0je"
      },
      "outputs": [],
      "source": [
        "identify_examples = IdentifyExamples(\n",
        "    orig_examples=example_gen.outputs['examples'],\n",
        "    component_name=u'IdentifyExamples',\n",
        "    id_feature_name=u'id')\n",
        "context.run(identify_examples, enable_cache=False)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "csM6BFhtk5Aa"
      },
      "source": [
        "### The StatisticsGen Component\n",
        "\n",
        "The `StatisticsGen` component computes descriptive statistics for your dataset.  The statistics that it generates can be visualized for review, and are used for example validation and to infer a schema.\n",
        "\n",
        "Create a StatisticsGen component and run it."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "MAscCCYWgA-9"
      },
      "outputs": [],
      "source": [
        "# Computes statistics over data for visualization and example validation.\n",
        "statistics_gen = StatisticsGen(\n",
        "    examples=identify_examples.outputs[\"identified_examples\"])\n",
        "context.run(statistics_gen, enable_cache=True)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "HLKLTO9Nk60p"
      },
      "source": [
        "### The SchemaGen Component\n",
        "\n",
        "The `SchemaGen` component generates a schema for your data based on the statistics from StatisticsGen.  It tries to infer the data types of each of your features, and the ranges of legal values for categorical features.\n",
        "\n",
        "Create a SchemaGen component and run it."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "ygQvZ6hsiQ_J"
      },
      "outputs": [],
      "source": [
        "# Generates schema based on statistics files.\n",
        "schema_gen = SchemaGen(statistics=statistics_gen.outputs['statistics'])\n",
        "context.run(schema_gen, enable_cache=True)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "kdtU3u01FR-2"
      },
      "source": [
        "The generated artifact is just a `schema.pbtxt` containing a text representation of a `schema_pb2.Schema` protobuf:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "L6-tgKi6A_gK"
      },
      "outputs": [],
      "source": [
        "train_uri = schema_gen.outputs['schema'].get()[0].uri\n",
        "schema_filename = os.path.join(train_uri, 'schema.pbtxt')\n",
        "schema = tfx.utils.io_utils.parse_pbtxt_file(\n",
        "    file_name=schema_filename, message=schema_pb2.Schema())"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "FaSgx5qIFelw"
      },
      "source": [
        "It can be visualized using `tfdv.display_schema()` (we will look at this in more detail in a subsequent lab):"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "gycOsJIQFhi3"
      },
      "outputs": [],
      "source": [
        "tfdv.display_schema(schema)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "V1qcUuO9k9f8"
      },
      "source": [
        "### The ExampleValidator Component\n",
        "\n",
        "The `ExampleValidator` performs anomaly detection, based on the statistics from StatisticsGen and the schema from SchemaGen.  It looks for problems such as missing values, values of the wrong type, or categorical values outside of the domain of acceptable values.\n",
        "\n",
        "Create an ExampleValidator component and run it."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "XRlRUuGgiXks"
      },
      "outputs": [],
      "source": [
        "# Performs anomaly detection based on statistics and data schema.\n",
        "validate_stats = ExampleValidator(\n",
        "    statistics=statistics_gen.outputs['statistics'],\n",
        "    schema=schema_gen.outputs['schema'])\n",
        "context.run(validate_stats, enable_cache=False)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "g3f2vmrF_e9b"
      },
      "source": [
        "### The SynthesizeGraph Component"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "3oCuXo4BPfGr"
      },
      "source": [
        "Graph construction involves creating embeddings for text samples and then using\n",
        "a similarity function to compare the embeddings."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Gf8B3KxcinZ0"
      },
      "source": [
        "We will use pretrained Swivel embeddings to create embeddings in the\n",
        "`tf.train.Example` format for each sample in the input. We will store the\n",
        "resulting embeddings in the `TFRecord` format along with the sample's ID.\n",
        "This is important and will allow us match sample embeddings with corresponding\n",
        "nodes in the graph later."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "_hSzZNdbPa4X"
      },
      "source": [
        "Once we have the sample embeddings, we will use them to build a similarity\n",
        "graph, i.e, nodes in this graph will correspond to samples and edges in this\n",
        "graph will correspond to similarity between pairs of nodes.\n",
        "\n",
        "Neural Structured Learning provides a graph building library to build a graph\n",
        "based on sample embeddings. It uses **cosine similarity** as the similarity\n",
        "measure to compare embeddings and build edges between them. It also allows us to specify a similarity threshold, which can be used to discard dissimilar edges from the final graph. In the following example, using 0.99 as the similarity threshold, we end up with a graph that has 115,368 bi-directional edges."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "nERXNfSWPa4Z"
      },
      "source": [
        "**Note:** Graph quality and by extension, embedding quality, are very important\n",
        "for graph regularization. While we use Swivel embeddings in this notebook, using BERT embeddings for instance, will likely capture review semantics more\n",
        "accurately. We encourage users to use embeddings of their choice and as appropriate to their needs."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "2bAttbhgPa4V"
      },
      "outputs": [],
      "source": [
        "swivel_url = 'https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim/1'\n",
        "hub_layer = hub.KerasLayer(swivel_url, input_shape=[], dtype=tf.string)\n",
        "\n",
        "\n",
        "def _bytes_feature(value):\n",
        "  \"\"\"Returns a bytes_list from a string / byte.\"\"\"\n",
        "  return tf.train.Feature(bytes_list=tf.train.BytesList(value=value))\n",
        "\n",
        "\n",
        "def _float_feature(value):\n",
        "  \"\"\"Returns a float_list from a float / double.\"\"\"\n",
        "  return tf.train.Feature(float_list=tf.train.FloatList(value=value))\n",
        "\n",
        "\n",
        "def create_embedding_example(example):\n",
        "  \"\"\"Create tf.Example containing the sample's embedding and its ID.\"\"\"\n",
        "  sentence_embedding = hub_layer(tf.sparse.to_dense(example['text']))\n",
        "\n",
        "  # Flatten the sentence embedding back to 1-D.\n",
        "  sentence_embedding = tf.reshape(sentence_embedding, shape=[-1])\n",
        "\n",
        "  feature_dict = {\n",
        "      'id': _bytes_feature(tf.sparse.to_dense(example['id']).numpy()),\n",
        "      'embedding': _float_feature(sentence_embedding.numpy().tolist())\n",
        "  }\n",
        "\n",
        "  return tf.train.Example(features=tf.train.Features(feature=feature_dict))\n",
        "\n",
        "\n",
        "def create_dataset(uri):\n",
        "  tfrecord_filenames = [os.path.join(uri, name) for name in os.listdir(uri)]\n",
        "  return tf.data.TFRecordDataset(tfrecord_filenames, compression_type='GZIP')\n",
        "\n",
        "\n",
        "def create_embeddings(train_path, output_path):\n",
        "  dataset = create_dataset(train_path)\n",
        "  embeddings_path = os.path.join(output_path, 'embeddings.tfr')\n",
        "\n",
        "  feature_map = {\n",
        "      'label': tf.io.FixedLenFeature([], tf.int64),\n",
        "      'id': tf.io.VarLenFeature(tf.string),\n",
        "      'text': tf.io.VarLenFeature(tf.string)\n",
        "  }\n",
        "\n",
        "  with tf.io.TFRecordWriter(embeddings_path) as writer:\n",
        "    for tfrecord in dataset:\n",
        "      tensor_dict = tf.io.parse_single_example(tfrecord, feature_map)\n",
        "      embedding_example = create_embedding_example(tensor_dict)\n",
        "      writer.write(embedding_example.SerializeToString())\n",
        "\n",
        "\n",
        "def build_graph(output_path, similarity_threshold):\n",
        "  embeddings_path = os.path.join(output_path, 'embeddings.tfr')\n",
        "  graph_path = os.path.join(output_path, 'graph.tfv')\n",
        "  nsl.tools.build_graph([embeddings_path], graph_path, similarity_threshold)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "ITkf2SLg1TG7"
      },
      "outputs": [],
      "source": [
        "\"\"\"Custom Artifact type\"\"\"\n",
        "\n",
        "\n",
        "class SynthesizedGraph(tfx.types.artifact.Artifact):\n",
        "  \"\"\"Output artifact of the SynthesizeGraph component\"\"\"\n",
        "  TYPE_NAME = 'SynthesizedGraphPath'\n",
        "  PROPERTIES = {\n",
        "      'span': standard_artifacts.SPAN_PROPERTY,\n",
        "      'split_names': standard_artifacts.SPLIT_NAMES_PROPERTY,\n",
        "  }\n",
        "\n",
        "\n",
        "@component\n",
        "def SynthesizeGraph(identified_examples: InputArtifact[Examples],\n",
        "                    synthesized_graph: OutputArtifact[SynthesizedGraph],\n",
        "                    similarity_threshold: Parameter[float],\n",
        "                    component_name: Parameter[str]) -> None:\n",
        "\n",
        "  # Get a list of the splits in input_data\n",
        "  splits_list = artifact_utils.decode_split_names(\n",
        "      split_names=identified_examples.split_names)\n",
        "\n",
        "  # We build a graph only based on the 'train' split which includes both\n",
        "  # labeled and unlabeled examples.\n",
        "  train_input_examples_uri = os.path.join(identified_examples.uri, 'train')\n",
        "  output_graph_uri = os.path.join(synthesized_graph.uri, 'train')\n",
        "  os.mkdir(output_graph_uri)\n",
        "\n",
        "  print('Creating embeddings...')\n",
        "  create_embeddings(train_input_examples_uri, output_graph_uri)\n",
        "\n",
        "  print('Synthesizing graph...')\n",
        "  build_graph(output_graph_uri, similarity_threshold)\n",
        "\n",
        "  synthesized_graph.split_names = artifact_utils.encode_split_names(\n",
        "      splits=['train'])\n",
        "\n",
        "  return"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "H0ZkHvJMA-0G"
      },
      "outputs": [],
      "source": [
        "synthesize_graph = SynthesizeGraph(\n",
        "    identified_examples=identify_examples.outputs['identified_examples'],\n",
        "    component_name=u'SynthesizeGraph',\n",
        "    similarity_threshold=0.99)\n",
        "context.run(synthesize_graph, enable_cache=False)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "o54M-0Q11FcS"
      },
      "outputs": [],
      "source": [
        "train_uri = synthesize_graph.outputs[\"synthesized_graph\"].get()[0].uri\n",
        "os.listdir(train_uri)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "IRK_rS_q1UcZ"
      },
      "outputs": [],
      "source": [
        "graph_path = os.path.join(train_uri, \"train\", \"graph.tfv\")\n",
        "print(\"node 1\\t\\t\\t\\t\\tnode 2\\t\\t\\t\\t\\tsimilarity\")\n",
        "!head {graph_path}\n",
        "print(\"...\")\n",
        "!tail {graph_path}"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "uybqyWztvCGm"
      },
      "outputs": [],
      "source": [
        "!wc -l {graph_path}"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "JPViEz5RlA36"
      },
      "source": [
        "### The Transform Component\n",
        "\n",
        "The `Transform` component performs data transformations and feature engineering.  The results include an input TensorFlow graph which is used during both training and serving to preprocess the data before training or inference.  This graph becomes part of the SavedModel that is the result of model training.  Since the same input graph is used for both training and serving, the preprocessing will always be the same, and only needs to be written once.\n",
        "\n",
        "The Transform component requires more code than many other components because of the arbitrary complexity of the feature engineering that you may need for the data and/or model that you're working with.  It requires code files to be available which define the processing needed."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "_USkfut69gNW"
      },
      "source": [
        "Each sample will include the following three features:\n",
        "\n",
        "1.  **id**: The node ID of the sample.\n",
        "2.  **text_xf**: An int64 list containing word IDs.\n",
        "3.  **label_xf**: A singleton int64 identifying the target class of the review: 0=negative, 1=positive."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "XUYeCayFG7kH"
      },
      "source": [
        "Let's define a module containing the `preprocessing_fn()` function that we will pass to the `Transform` component:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "7uuWiQbOG9ki"
      },
      "outputs": [],
      "source": [
        "_transform_module_file = 'imdb_transform.py'"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "v3EIuVQnBfH7"
      },
      "outputs": [],
      "source": [
        "%%writefile {_transform_module_file}\n",
        "\n",
        "import tensorflow as tf\n",
        "\n",
        "import tensorflow_transform as tft\n",
        "\n",
        "SEQUENCE_LENGTH = 100\n",
        "VOCAB_SIZE = 10000\n",
        "OOV_SIZE = 100\n",
        "\n",
        "def tokenize_reviews(reviews, sequence_length=SEQUENCE_LENGTH):\n",
        "  reviews = tf.strings.lower(reviews)\n",
        "  reviews = tf.strings.regex_replace(reviews, r\" '| '|^'|'$\", \" \")\n",
        "  reviews = tf.strings.regex_replace(reviews, \"[^a-z' ]\", \" \")\n",
        "  tokens = tf.strings.split(reviews)[:, :sequence_length]\n",
        "  start_tokens = tf.fill([tf.shape(reviews)[0], 1], \"<START>\")\n",
        "  end_tokens = tf.fill([tf.shape(reviews)[0], 1], \"<END>\")\n",
        "  tokens = tf.concat([start_tokens, tokens, end_tokens], axis=1)\n",
        "  tokens = tokens[:, :sequence_length]\n",
        "  tokens = tokens.to_tensor(default_value=\"<PAD>\")\n",
        "  pad = sequence_length - tf.shape(tokens)[1]\n",
        "  tokens = tf.pad(tokens, [[0, 0], [0, pad]], constant_values=\"<PAD>\")\n",
        "  return tf.reshape(tokens, [-1, sequence_length])\n",
        "\n",
        "def preprocessing_fn(inputs):\n",
        "  \"\"\"tf.transform's callback function for preprocessing inputs.\n",
        "\n",
        "  Args:\n",
        "    inputs: map from feature keys to raw not-yet-transformed features.\n",
        "\n",
        "  Returns:\n",
        "    Map from string feature key to transformed feature operations.\n",
        "  \"\"\"\n",
        "  outputs = {}\n",
        "  outputs[\"id\"] = inputs[\"id\"]\n",
        "  tokens = tokenize_reviews(_fill_in_missing(inputs[\"text\"], ''))\n",
        "  outputs[\"text_xf\"] = tft.compute_and_apply_vocabulary(\n",
        "      tokens,\n",
        "      top_k=VOCAB_SIZE,\n",
        "      num_oov_buckets=OOV_SIZE)\n",
        "  outputs[\"label_xf\"] = _fill_in_missing(inputs[\"label\"], -1)\n",
        "  return outputs\n",
        "\n",
        "def _fill_in_missing(x, default_value):\n",
        "  \"\"\"Replace missing values in a SparseTensor.\n",
        "\n",
        "  Fills in missing values of `x` with the default_value.\n",
        "\n",
        "  Args:\n",
        "    x: A `SparseTensor` of rank 2.  Its dense shape should have size at most 1\n",
        "      in the second dimension.\n",
        "    default_value: the value with which to replace the missing values.\n",
        "\n",
        "  Returns:\n",
        "    A rank 1 tensor where missing values of `x` have been filled in.\n",
        "  \"\"\"\n",
        "  return tf.squeeze(\n",
        "      tf.sparse.to_dense(\n",
        "          tf.SparseTensor(x.indices, x.values, [x.dense_shape[0], 1]),\n",
        "          default_value),\n",
        "      axis=1)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "eeMVMafpHHX1"
      },
      "source": [
        "Create and run the `Transform` component, referring to the files that were created above."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "jHfhth_GiZI9"
      },
      "outputs": [],
      "source": [
        "# Performs transformations and feature engineering in training and serving.\n",
        "transform = Transform(\n",
        "    examples=identify_examples.outputs['identified_examples'],\n",
        "    schema=schema_gen.outputs['schema'],\n",
        "    # TODO(b/169218106): Remove transformed_examples kwargs after bugfix is released.\n",
        "    transformed_examples=channel.Channel(\n",
        "        type=standard_artifacts.Examples,\n",
        "        artifacts=[standard_artifacts.Examples()]),\n",
        "    module_file=_transform_module_file)\n",
        "context.run(transform)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "_jbZO1ykHOeG"
      },
      "source": [
        "The `Transform` component has 2 types of outputs:\n",
        "* `transform_graph` is the graph that can perform the preprocessing operations (this graph will be included in the serving and evaluation models).\n",
        "* `transformed_examples` represents the preprocessed training and evaluation data."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "j4UjersvAC7p"
      },
      "outputs": [],
      "source": [
        "transform.outputs"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "wRFMlRcdHlQy"
      },
      "source": [
        "Take a peek at the `transform_graph` artifact: it points to a directory containing 3 subdirectories:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "E4I-cqfQQvaW"
      },
      "outputs": [],
      "source": [
        "train_uri = transform.outputs['transform_graph'].get()[0].uri\n",
        "os.listdir(train_uri)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "9374B4RpHzor"
      },
      "source": [
        "The `transform_fn` subdirectory contains the actual preprocessing graph. The `metadata` subdirectory contains the schema of the original data. The `transformed_metadata` subdirectory contains the schema of the preprocessed data.\n",
        "\n",
        "Take a look at some of the transformed examples and check that they are indeed processed as intended."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "-QPONyzDTswf"
      },
      "outputs": [],
      "source": [
        "def pprint_examples(artifact, n_examples=3):\n",
        "  print(\"artifact:\", artifact)\n",
        "  uri = os.path.join(artifact.uri, \"train\")\n",
        "  print(\"uri:\", uri)\n",
        "  tfrecord_filenames = [os.path.join(uri, name) for name in os.listdir(uri)]\n",
        "  print(\"tfrecord_filenames:\", tfrecord_filenames)\n",
        "  dataset = tf.data.TFRecordDataset(tfrecord_filenames, compression_type=\"GZIP\")\n",
        "  for tfrecord in dataset.take(n_examples):\n",
        "    serialized_example = tfrecord.numpy()\n",
        "    example = tf.train.Example.FromString(serialized_example)\n",
        "    pp.pprint(example)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "2zIepQhSQoPa"
      },
      "outputs": [],
      "source": [
        "pprint_examples(transform.outputs['transformed_examples'].get()[0])"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "vpGvPKielIvI"
      },
      "source": [
        "### The GraphAugmentation Component\n",
        "\n",
        "Since we have the sample features and the synthesized graph, we can generate the\n",
        "augmented training data for Neural Structured Learning. The NSL framework\n",
        "provides a library to combine the graph and the sample features to produce\n",
        "the final training data for graph regularization. The resulting training data\n",
        "will include original sample features as well as features of their corresponding\n",
        "neighbors.\n",
        "\n",
        "In this tutorial, we consider undirected edges and use a maximum of 3 neighbors\n",
        "per sample to augment training data with graph neighbors."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "gI6P_-AXGm04"
      },
      "outputs": [],
      "source": [
        "def split_train_and_unsup(input_uri):\n",
        "  'Separate the labeled and unlabeled instances.'\n",
        "\n",
        "  tmp_dir = tempfile.mkdtemp(prefix='tfx-data')\n",
        "  tfrecord_filenames = [\n",
        "      os.path.join(input_uri, filename) for filename in os.listdir(input_uri)\n",
        "  ]\n",
        "  train_path = os.path.join(tmp_dir, 'train.tfrecord')\n",
        "  unsup_path = os.path.join(tmp_dir, 'unsup.tfrecord')\n",
        "  with tf.io.TFRecordWriter(train_path) as train_writer, \\\n",
        "       tf.io.TFRecordWriter(unsup_path) as unsup_writer:\n",
        "    for tfrecord in tf.data.TFRecordDataset(\n",
        "        tfrecord_filenames, compression_type='GZIP'):\n",
        "      example = tf.train.Example()\n",
        "      example.ParseFromString(tfrecord.numpy())\n",
        "      if ('label_xf' not in example.features.feature or\n",
        "          example.features.feature['label_xf'].int64_list.value[0] == -1):\n",
        "        writer = unsup_writer\n",
        "      else:\n",
        "        writer = train_writer\n",
        "      writer.write(tfrecord.numpy())\n",
        "  return train_path, unsup_path\n",
        "\n",
        "\n",
        "def gzip(filepath):\n",
        "  with open(filepath, 'rb') as f_in:\n",
        "    with gzip_lib.open(filepath + '.gz', 'wb') as f_out:\n",
        "      shutil.copyfileobj(f_in, f_out)\n",
        "  os.remove(filepath)\n",
        "\n",
        "\n",
        "def copy_tfrecords(input_uri, output_uri):\n",
        "  for filename in os.listdir(input_uri):\n",
        "    input_filename = os.path.join(input_uri, filename)\n",
        "    output_filename = os.path.join(output_uri, filename)\n",
        "    shutil.copyfile(input_filename, output_filename)\n",
        "\n",
        "\n",
        "@component\n",
        "def GraphAugmentation(identified_examples: InputArtifact[Examples],\n",
        "                      synthesized_graph: InputArtifact[SynthesizedGraph],\n",
        "                      augmented_examples: OutputArtifact[Examples],\n",
        "                      num_neighbors: Parameter[int],\n",
        "                      component_name: Parameter[str]) -> None:\n",
        "\n",
        "  # Get a list of the splits in input_data\n",
        "  splits_list = artifact_utils.decode_split_names(\n",
        "      split_names=identified_examples.split_names)\n",
        "\n",
        "  train_input_uri = os.path.join(identified_examples.uri, 'train')\n",
        "  eval_input_uri = os.path.join(identified_examples.uri, 'eval')\n",
        "  train_graph_uri = os.path.join(synthesized_graph.uri, 'train')\n",
        "  train_output_uri = os.path.join(augmented_examples.uri, 'train')\n",
        "  eval_output_uri = os.path.join(augmented_examples.uri, 'eval')\n",
        "\n",
        "  os.mkdir(train_output_uri)\n",
        "  os.mkdir(eval_output_uri)\n",
        "\n",
        "  # Separate out the labeled and unlabeled examples from the 'train' split.\n",
        "  train_path, unsup_path = split_train_and_unsup(train_input_uri)\n",
        "\n",
        "  output_path = os.path.join(train_output_uri, 'nsl_train_data.tfr')\n",
        "  pack_nbrs_args = dict(\n",
        "      labeled_examples_path=train_path,\n",
        "      unlabeled_examples_path=unsup_path,\n",
        "      graph_path=os.path.join(train_graph_uri, 'graph.tfv'),\n",
        "      output_training_data_path=output_path,\n",
        "      add_undirected_edges=True,\n",
        "      max_nbrs=num_neighbors)\n",
        "  print('nsl.tools.pack_nbrs arguments:', pack_nbrs_args)\n",
        "  nsl.tools.pack_nbrs(**pack_nbrs_args)\n",
        "\n",
        "  # Downstream components expect gzip'ed TFRecords.\n",
        "  gzip(output_path)\n",
        "\n",
        "  # The test examples are left untouched and are simply copied over.\n",
        "  copy_tfrecords(eval_input_uri, eval_output_uri)\n",
        "\n",
        "  augmented_examples.split_names = identified_examples.split_names\n",
        "\n",
        "  return"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "r9MIEVDiOANe"
      },
      "outputs": [],
      "source": [
        "# Augments training data with graph neighbors.\n",
        "graph_augmentation = GraphAugmentation(\n",
        "    identified_examples=transform.outputs['transformed_examples'],\n",
        "    synthesized_graph=synthesize_graph.outputs['synthesized_graph'],\n",
        "    component_name=u'GraphAugmentation',\n",
        "    num_neighbors=3)\n",
        "context.run(graph_augmentation, enable_cache=False)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "gpSLs3Hx8viI"
      },
      "outputs": [],
      "source": [
        "pprint_examples(graph_augmentation.outputs['augmented_examples'].get()[0], 6)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "OBJFtnl6lCg9"
      },
      "source": [
        "### The Trainer Component\n",
        "\n",
        "The `Trainer` component trains models using TensorFlow.\n",
        "\n",
        "Create a Python module containing a `trainer_fn` function, which must return an estimator.  If you prefer creating a Keras model, you can do so and then convert it to an estimator using `keras.model_to_estimator()`."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "5ajvClE6b2pd"
      },
      "outputs": [],
      "source": [
        "# Setup paths.\n",
        "_trainer_module_file = 'imdb_trainer.py'"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "_dh6AejVk2Oq"
      },
      "outputs": [],
      "source": [
        "%%writefile {_trainer_module_file}\n",
        "\n",
        "import neural_structured_learning as nsl\n",
        "\n",
        "import tensorflow as tf\n",
        "\n",
        "import tensorflow_model_analysis as tfma\n",
        "import tensorflow_transform as tft\n",
        "from tensorflow_transform.tf_metadata import schema_utils\n",
        "\n",
        "\n",
        "NBR_FEATURE_PREFIX = 'NL_nbr_'\n",
        "NBR_WEIGHT_SUFFIX = '_weight'\n",
        "LABEL_KEY = 'label'\n",
        "ID_FEATURE_KEY = 'id'\n",
        "\n",
        "def _transformed_name(key):\n",
        "  return key + '_xf'\n",
        "\n",
        "\n",
        "def _transformed_names(keys):\n",
        "  return [_transformed_name(key) for key in keys]\n",
        "\n",
        "\n",
        "# Hyperparameters:\n",
        "#\n",
        "# We will use an instance of `HParams` to inclue various hyperparameters and\n",
        "# constants used for training and evaluation. We briefly describe each of them\n",
        "# below:\n",
        "#\n",
        "# -   max_seq_length: This is the maximum number of words considered from each\n",
        "#                     movie review in this example.\n",
        "# -   vocab_size: This is the size of the vocabulary considered for this\n",
        "#                 example.\n",
        "# -   oov_size: This is the out-of-vocabulary size considered for this example.\n",
        "# -   distance_type: This is the distance metric used to regularize the sample\n",
        "#                    with its neighbors.\n",
        "# -   graph_regularization_multiplier: This controls the relative weight of the\n",
        "#                                      graph regularization term in the overall\n",
        "#                                      loss function.\n",
        "# -   num_neighbors: The number of neighbors used for graph regularization. This\n",
        "#                    value has to be less than or equal to the `num_neighbors`\n",
        "#                    argument used above in the GraphAugmentation component when\n",
        "#                    invoking `nsl.tools.pack_nbrs`.\n",
        "# -   num_fc_units: The number of units in the fully connected layer of the\n",
        "#                   neural network.\n",
        "class HParams(object):\n",
        "  \"\"\"Hyperparameters used for training.\"\"\"\n",
        "  def __init__(self):\n",
        "    ### dataset parameters\n",
        "    # The following 3 values should match those defined in the Transform\n",
        "    # Component.\n",
        "    self.max_seq_length = 100\n",
        "    self.vocab_size = 10000\n",
        "    self.oov_size = 100\n",
        "    ### Neural Graph Learning parameters\n",
        "    self.distance_type = nsl.configs.DistanceType.L2\n",
        "    self.graph_regularization_multiplier = 0.1\n",
        "    # The following value has to be at most the value of 'num_neighbors' used\n",
        "    # in the GraphAugmentation component.\n",
        "    self.num_neighbors = 1\n",
        "    ### Model Architecture\n",
        "    self.num_embedding_dims = 16\n",
        "    self.num_fc_units = 64\n",
        "\n",
        "HPARAMS = HParams()\n",
        "\n",
        "\n",
        "def optimizer_fn():\n",
        "  \"\"\"Returns an instance of `tf.Optimizer`.\"\"\"\n",
        "  return tf.compat.v1.train.RMSPropOptimizer(\n",
        "    learning_rate=0.0001, decay=1e-6)\n",
        "\n",
        "\n",
        "def build_train_op(loss, global_step):\n",
        "  \"\"\"Builds a train op to optimize the given loss using gradient descent.\"\"\"\n",
        "  with tf.name_scope('train'):\n",
        "    optimizer = optimizer_fn()\n",
        "    train_op = optimizer.minimize(loss=loss, global_step=global_step)\n",
        "  return train_op\n",
        "\n",
        "\n",
        "# Building the model:\n",
        "#\n",
        "# A neural network is created by stacking layers—this requires two main\n",
        "# architectural decisions:\n",
        "# * How many layers to use in the model?\n",
        "# * How many *hidden units* to use for each layer?\n",
        "#\n",
        "# In this example, the input data consists of an array of word-indices. The\n",
        "# labels to predict are either 0 or 1. We will use a feed-forward neural network\n",
        "# as our base model in this tutorial.\n",
        "def feed_forward_model(features, is_training, reuse=tf.compat.v1.AUTO_REUSE):\n",
        "  \"\"\"Builds a simple 2 layer feed forward neural network.\n",
        "\n",
        "  The layers are effectively stacked sequentially to build the classifier. The\n",
        "  first layer is an Embedding layer, which takes the integer-encoded vocabulary\n",
        "  and looks up the embedding vector for each word-index. These vectors are\n",
        "  learned as the model trains. The vectors add a dimension to the output array.\n",
        "  The resulting dimensions are: (batch, sequence, embedding). Next is a global\n",
        "  average pooling 1D layer, which reduces the dimensionality of its inputs from\n",
        "  3D to 2D. This fixed-length output vector is piped through a fully-connected\n",
        "  (Dense) layer with 16 hidden units. The last layer is densely connected with a\n",
        "  single output node. Using the sigmoid activation function, this value is a\n",
        "  float between 0 and 1, representing a probability, or confidence level.\n",
        "\n",
        "  Args:\n",
        "    features: A dictionary containing batch features returned from the\n",
        "      `input_fn`, that include sample features, corresponding neighbor features,\n",
        "      and neighbor weights.\n",
        "    is_training: a Python Boolean value or a Boolean scalar Tensor, indicating\n",
        "      whether to apply dropout.\n",
        "    reuse: a Python Boolean value for reusing variable scope.\n",
        "\n",
        "  Returns:\n",
        "    logits: Tensor of shape [batch_size, 1].\n",
        "    representations: Tensor of shape [batch_size, _] for graph regularization.\n",
        "      This is the representation of each example at the graph regularization\n",
        "      layer.\n",
        "  \"\"\"\n",
        "\n",
        "  with tf.compat.v1.variable_scope('ff', reuse=reuse):\n",
        "    inputs = features[_transformed_name('text')]\n",
        "    embeddings = tf.compat.v1.get_variable(\n",
        "        'embeddings',\n",
        "        shape=[\n",
        "            HPARAMS.vocab_size + HPARAMS.oov_size, HPARAMS.num_embedding_dims\n",
        "        ])\n",
        "    embedding_layer = tf.nn.embedding_lookup(embeddings, inputs)\n",
        "\n",
        "    pooling_layer = tf.compat.v1.layers.AveragePooling1D(\n",
        "        pool_size=HPARAMS.max_seq_length, strides=HPARAMS.max_seq_length)(\n",
        "            embedding_layer)\n",
        "    # Shape of pooling_layer is now [batch_size, 1, HPARAMS.num_embedding_dims]\n",
        "    pooling_layer = tf.reshape(pooling_layer, [-1, HPARAMS.num_embedding_dims])\n",
        "\n",
        "    dense_layer = tf.compat.v1.layers.Dense(\n",
        "        16, activation='relu')(\n",
        "            pooling_layer)\n",
        "\n",
        "    output_layer = tf.compat.v1.layers.Dense(\n",
        "        1, activation='sigmoid')(\n",
        "            dense_layer)\n",
        "\n",
        "    # Graph regularization will be done on the penultimate (dense) layer\n",
        "    # because the output layer is a single floating point number.\n",
        "    return output_layer, dense_layer\n",
        "\n",
        "\n",
        "# A note on hidden units:\n",
        "#\n",
        "# The above model has two intermediate or \"hidden\" layers, between the input and\n",
        "# output, and excluding the Embedding layer. The number of outputs (units,\n",
        "# nodes, or neurons) is the dimension of the representational space for the\n",
        "# layer. In other words, the amount of freedom the network is allowed when\n",
        "# learning an internal representation. If a model has more hidden units\n",
        "# (a higher-dimensional representation space), and/or more layers, then the\n",
        "# network can learn more complex representations. However, it makes the network\n",
        "# more computationally expensive and may lead to learning unwanted\n",
        "# patterns—patterns that improve performance on training data but not on the\n",
        "# test data. This is called overfitting.\n",
        "\n",
        "\n",
        "# This function will be used to generate the embeddings for samples and their\n",
        "# corresponding neighbors, which will then be used for graph regularization.\n",
        "def embedding_fn(features, mode):\n",
        "  \"\"\"Returns the embedding corresponding to the given features.\n",
        "\n",
        "  Args:\n",
        "    features: A dictionary containing batch features returned from the\n",
        "      `input_fn`, that include sample features, corresponding neighbor features,\n",
        "      and neighbor weights.\n",
        "    mode: Specifies if this is training, evaluation, or prediction. See\n",
        "      tf.estimator.ModeKeys.\n",
        "\n",
        "  Returns:\n",
        "    The embedding that will be used for graph regularization.\n",
        "  \"\"\"\n",
        "  is_training = (mode == tf.estimator.ModeKeys.TRAIN)\n",
        "  _, embedding = feed_forward_model(features, is_training)\n",
        "  return embedding\n",
        "\n",
        "\n",
        "def feed_forward_model_fn(features, labels, mode, params, config):\n",
        "  \"\"\"Implementation of the model_fn for the base feed-forward model.\n",
        "\n",
        "  Args:\n",
        "    features: This is the first item returned from the `input_fn` passed to\n",
        "      `train`, `evaluate`, and `predict`. This should be a single `Tensor` or\n",
        "      `dict` of same.\n",
        "    labels: This is the second item returned from the `input_fn` passed to\n",
        "      `train`, `evaluate`, and `predict`. This should be a single `Tensor` or\n",
        "      `dict` of same (for multi-head models). If mode is `ModeKeys.PREDICT`,\n",
        "      `labels=None` will be passed. If the `model_fn`'s signature does not\n",
        "      accept `mode`, the `model_fn` must still be able to handle `labels=None`.\n",
        "    mode: Optional. Specifies if this training, evaluation or prediction. See\n",
        "      `ModeKeys`.\n",
        "    params: An HParams instance as returned by get_hyper_parameters().\n",
        "    config: Optional configuration object. Will receive what is passed to\n",
        "      Estimator in `config` parameter, or the default `config`. Allows updating\n",
        "      things in your model_fn based on configuration such as `num_ps_replicas`,\n",
        "      or `model_dir`. Unused currently.\n",
        "\n",
        "  Returns:\n",
        "     A `tf.estimator.EstimatorSpec` for the base feed-forward model. This does\n",
        "     not include graph-based regularization.\n",
        "  \"\"\"\n",
        "\n",
        "  is_training = mode == tf.estimator.ModeKeys.TRAIN\n",
        "\n",
        "  # Build the computation graph.\n",
        "  probabilities, _ = feed_forward_model(features, is_training)\n",
        "  predictions = tf.round(probabilities)\n",
        "\n",
        "  if mode == tf.estimator.ModeKeys.PREDICT:\n",
        "    # labels will be None, and no loss to compute.\n",
        "    cross_entropy_loss = None\n",
        "    eval_metric_ops = None\n",
        "  else:\n",
        "    # Loss is required in train and eval modes.\n",
        "    # Flatten 'probabilities' to 1-D.\n",
        "    probabilities = tf.reshape(probabilities, shape=[-1])\n",
        "    cross_entropy_loss = tf.compat.v1.keras.losses.binary_crossentropy(\n",
        "        labels, probabilities)\n",
        "    eval_metric_ops = {\n",
        "        'accuracy': tf.compat.v1.metrics.accuracy(labels, predictions)\n",
        "    }\n",
        "\n",
        "  if is_training:\n",
        "    global_step = tf.compat.v1.train.get_or_create_global_step()\n",
        "    train_op = build_train_op(cross_entropy_loss, global_step)\n",
        "  else:\n",
        "    train_op = None\n",
        "\n",
        "  return tf.estimator.EstimatorSpec(\n",
        "      mode=mode,\n",
        "      predictions={\n",
        "          'probabilities': probabilities,\n",
        "          'predictions': predictions\n",
        "      },\n",
        "      loss=cross_entropy_loss,\n",
        "      train_op=train_op,\n",
        "      eval_metric_ops=eval_metric_ops)\n",
        "\n",
        "\n",
        "# Tf.Transform considers these features as \"raw\"\n",
        "def _get_raw_feature_spec(schema):\n",
        "  return schema_utils.schema_as_feature_spec(schema).feature_spec\n",
        "\n",
        "\n",
        "def _gzip_reader_fn(filenames):\n",
        "  \"\"\"Small utility returning a record reader that can read gzip'ed files.\"\"\"\n",
        "  return tf.data.TFRecordDataset(\n",
        "      filenames,\n",
        "      compression_type='GZIP')\n",
        "\n",
        "\n",
        "def _example_serving_receiver_fn(tf_transform_output, schema):\n",
        "  \"\"\"Build the serving in inputs.\n",
        "\n",
        "  Args:\n",
        "    tf_transform_output: A TFTransformOutput.\n",
        "    schema: the schema of the input data.\n",
        "\n",
        "  Returns:\n",
        "    Tensorflow graph which parses examples, applying tf-transform to them.\n",
        "  \"\"\"\n",
        "  raw_feature_spec = _get_raw_feature_spec(schema)\n",
        "  raw_feature_spec.pop(LABEL_KEY)\n",
        "\n",
        "  # We don't need the ID feature for serving.\n",
        "  raw_feature_spec.pop(ID_FEATURE_KEY)\n",
        "\n",
        "  raw_input_fn = tf.estimator.export.build_parsing_serving_input_receiver_fn(\n",
        "      raw_feature_spec, default_batch_size=None)\n",
        "  serving_input_receiver = raw_input_fn()\n",
        "\n",
        "  transformed_features = tf_transform_output.transform_raw_features(\n",
        "      serving_input_receiver.features)\n",
        "\n",
        "  # Even though, LABEL_KEY was removed from 'raw_feature_spec', the transform\n",
        "  # operation would have injected the transformed LABEL_KEY feature with a\n",
        "  # default value.\n",
        "  transformed_features.pop(_transformed_name(LABEL_KEY))\n",
        "  return tf.estimator.export.ServingInputReceiver(\n",
        "      transformed_features, serving_input_receiver.receiver_tensors)\n",
        "\n",
        "\n",
        "def _eval_input_receiver_fn(tf_transform_output, schema):\n",
        "  \"\"\"Build everything needed for the tf-model-analysis to run the model.\n",
        "\n",
        "  Args:\n",
        "    tf_transform_output: A TFTransformOutput.\n",
        "    schema: the schema of the input data.\n",
        "\n",
        "  Returns:\n",
        "    EvalInputReceiver function, which contains:\n",
        "      - Tensorflow graph which parses raw untransformed features, applies the\n",
        "        tf-transform preprocessing operators.\n",
        "      - Set of raw, untransformed features.\n",
        "      - Label against which predictions will be compared.\n",
        "  \"\"\"\n",
        "  # Notice that the inputs are raw features, not transformed features here.\n",
        "  raw_feature_spec = _get_raw_feature_spec(schema)\n",
        "\n",
        "  # We don't need the ID feature for TFMA.\n",
        "  raw_feature_spec.pop(ID_FEATURE_KEY)\n",
        "\n",
        "  raw_input_fn = tf.estimator.export.build_parsing_serving_input_receiver_fn(\n",
        "      raw_feature_spec, default_batch_size=None)\n",
        "  serving_input_receiver = raw_input_fn()\n",
        "\n",
        "  transformed_features = tf_transform_output.transform_raw_features(\n",
        "      serving_input_receiver.features)\n",
        "\n",
        "  labels = transformed_features.pop(_transformed_name(LABEL_KEY))\n",
        "  return tfma.export.EvalInputReceiver(\n",
        "      features=transformed_features,\n",
        "      receiver_tensors=serving_input_receiver.receiver_tensors,\n",
        "      labels=labels)\n",
        "\n",
        "\n",
        "def _augment_feature_spec(feature_spec, num_neighbors):\n",
        "  \"\"\"Augments `feature_spec` to include neighbor features.\n",
        "    Args:\n",
        "      feature_spec: Dictionary of feature keys mapping to TF feature types.\n",
        "      num_neighbors: Number of neighbors to use for feature key augmentation.\n",
        "    Returns:\n",
        "      An augmented `feature_spec` that includes neighbor feature keys.\n",
        "  \"\"\"\n",
        "  for i in range(num_neighbors):\n",
        "    feature_spec['{}{}_{}'.format(NBR_FEATURE_PREFIX, i, 'id')] = \\\n",
        "        tf.io.VarLenFeature(dtype=tf.string)\n",
        "    # We don't care about the neighbor features corresponding to\n",
        "    # _transformed_name(LABEL_KEY) because the LABEL_KEY feature will be\n",
        "    # removed from the feature spec during training/evaluation.\n",
        "    feature_spec['{}{}_{}'.format(NBR_FEATURE_PREFIX, i, 'text_xf')] = \\\n",
        "        tf.io.FixedLenFeature(shape=[HPARAMS.max_seq_length], dtype=tf.int64,\n",
        "                              default_value=tf.constant(0, dtype=tf.int64,\n",
        "                                                        shape=[HPARAMS.max_seq_length]))\n",
        "    # The 'NL_num_nbrs' features is currently not used.\n",
        "\n",
        "  # Set the neighbor weight feature keys.\n",
        "  for i in range(num_neighbors):\n",
        "    feature_spec['{}{}{}'.format(NBR_FEATURE_PREFIX, i, NBR_WEIGHT_SUFFIX)] = \\\n",
        "        tf.io.FixedLenFeature(shape=[1], dtype=tf.float32, default_value=[0.0])\n",
        "\n",
        "  return feature_spec\n",
        "\n",
        "\n",
        "def _input_fn(filenames, tf_transform_output, is_training, batch_size=200):\n",
        "  \"\"\"Generates features and labels for training or evaluation.\n",
        "\n",
        "  Args:\n",
        "    filenames: [str] list of CSV files to read data from.\n",
        "    tf_transform_output: A TFTransformOutput.\n",
        "    is_training: Boolean indicating if we are in training mode.\n",
        "    batch_size: int First dimension size of the Tensors returned by input_fn\n",
        "\n",
        "  Returns:\n",
        "    A (features, indices) tuple where features is a dictionary of\n",
        "      Tensors, and indices is a single Tensor of label indices.\n",
        "  \"\"\"\n",
        "  transformed_feature_spec = (\n",
        "      tf_transform_output.transformed_feature_spec().copy())\n",
        "\n",
        "  # During training, NSL uses augmented training data (which includes features\n",
        "  # from graph neighbors). So, update the feature spec accordingly. This needs\n",
        "  # to be done because we are using different schemas for NSL training and eval,\n",
        "  # but the Trainer Component only accepts a single schema.\n",
        "  if is_training:\n",
        "    transformed_feature_spec =_augment_feature_spec(transformed_feature_spec,\n",
        "                                                    HPARAMS.num_neighbors)\n",
        "\n",
        "  dataset = tf.data.experimental.make_batched_features_dataset(\n",
        "      filenames, batch_size, transformed_feature_spec, reader=_gzip_reader_fn)\n",
        "\n",
        "  transformed_features = tf.compat.v1.data.make_one_shot_iterator(\n",
        "      dataset).get_next()\n",
        "  # We pop the label because we do not want to use it as a feature while we're\n",
        "  # training.\n",
        "  return transformed_features, transformed_features.pop(\n",
        "      _transformed_name(LABEL_KEY))\n",
        "\n",
        "\n",
        "# TFX will call this function\n",
        "def trainer_fn(hparams, schema):\n",
        "  \"\"\"Build the estimator using the high level API.\n",
        "  Args:\n",
        "    hparams: Holds hyperparameters used to train the model as name/value pairs.\n",
        "    schema: Holds the schema of the training examples.\n",
        "  Returns:\n",
        "    A dict of the following:\n",
        "      - estimator: The estimator that will be used for training and eval.\n",
        "      - train_spec: Spec for training.\n",
        "      - eval_spec: Spec for eval.\n",
        "      - eval_input_receiver_fn: Input function for eval.\n",
        "  \"\"\"\n",
        "  train_batch_size = 40\n",
        "  eval_batch_size = 40\n",
        "\n",
        "  tf_transform_output = tft.TFTransformOutput(hparams.transform_output)\n",
        "\n",
        "  train_input_fn = lambda: _input_fn(\n",
        "      hparams.train_files,\n",
        "      tf_transform_output,\n",
        "      is_training=True,\n",
        "      batch_size=train_batch_size)\n",
        "\n",
        "  eval_input_fn = lambda: _input_fn(\n",
        "      hparams.eval_files,\n",
        "      tf_transform_output,\n",
        "      is_training=False,\n",
        "      batch_size=eval_batch_size)\n",
        "\n",
        "  train_spec = tf.estimator.TrainSpec(\n",
        "      train_input_fn,\n",
        "      max_steps=hparams.train_steps)\n",
        "\n",
        "  serving_receiver_fn = lambda: _example_serving_receiver_fn(\n",
        "      tf_transform_output, schema)\n",
        "\n",
        "  exporter = tf.estimator.FinalExporter('imdb', serving_receiver_fn)\n",
        "  eval_spec = tf.estimator.EvalSpec(\n",
        "      eval_input_fn,\n",
        "      steps=hparams.eval_steps,\n",
        "      exporters=[exporter],\n",
        "      name='imdb-eval')\n",
        "\n",
        "  run_config = tf.estimator.RunConfig(\n",
        "      save_checkpoints_steps=999, keep_checkpoint_max=1)\n",
        "\n",
        "  run_config = run_config.replace(model_dir=hparams.serving_model_dir)\n",
        "\n",
        "  estimator = tf.estimator.Estimator(\n",
        "      model_fn=feed_forward_model_fn, config=run_config, params=HPARAMS)\n",
        "\n",
        "  # Create a graph regularization config.\n",
        "  graph_reg_config = nsl.configs.make_graph_reg_config(\n",
        "      max_neighbors=HPARAMS.num_neighbors,\n",
        "      multiplier=HPARAMS.graph_regularization_multiplier,\n",
        "      distance_type=HPARAMS.distance_type,\n",
        "      sum_over_axis=-1)\n",
        "\n",
        "  # Invoke the Graph Regularization Estimator wrapper to incorporate\n",
        "  # graph-based regularization for training.\n",
        "  graph_nsl_estimator = nsl.estimator.add_graph_regularization(\n",
        "      estimator,\n",
        "      embedding_fn,\n",
        "      optimizer_fn=optimizer_fn,\n",
        "      graph_reg_config=graph_reg_config)\n",
        "\n",
        "  # Create an input receiver for TFMA processing\n",
        "  receiver_fn = lambda: _eval_input_receiver_fn(\n",
        "      tf_transform_output, schema)\n",
        "\n",
        "  return {\n",
        "      'estimator': graph_nsl_estimator,\n",
        "      'train_spec': train_spec,\n",
        "      'eval_spec': eval_spec,\n",
        "      'eval_input_receiver_fn': receiver_fn\n",
        "  }"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "GnLjStUJIoos"
      },
      "source": [
        "Create and run the `Trainer` component, passing it the file that we created above."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "MWLQI6t0b2pg"
      },
      "outputs": [],
      "source": [
        "# Uses user-provided Python function that implements a model using TensorFlow's\n",
        "# Estimators API.\n",
        "trainer = Trainer(\n",
        "    module_file=_trainer_module_file,\n",
        "    transformed_examples=graph_augmentation.outputs['augmented_examples'],\n",
        "    schema=schema_gen.outputs['schema'],\n",
        "    transform_graph=transform.outputs['transform_graph'],\n",
        "    train_args=trainer_pb2.TrainArgs(num_steps=10000),\n",
        "    eval_args=trainer_pb2.EvalArgs(num_steps=5000))\n",
        "context.run(trainer)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "pDiZvYbFb2ph"
      },
      "source": [
        "Take a peek at the trained model which was exported from `Trainer`."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "qDBZG9Oso-BD"
      },
      "outputs": [],
      "source": [
        "train_uri = trainer.outputs['model'].get()[0].uri\n",
        "serving_model_path = os.path.join(train_uri, 'serving_model_dir')\n",
        "exported_model = tf.saved_model.load(serving_model_path)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "KyT3ZVGCZWsj"
      },
      "outputs": [],
      "source": [
        "exported_model.graph.get_operations()[:10] + [\"...\"]"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "zIsspBf5GjKm"
      },
      "source": [
        "Let's visualize the model's metrics using Tensorboard."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "rnKeqLmcGqHH"
      },
      "outputs": [],
      "source": [
        "#docs_infra: no_execute\n",
        "\n",
        "# Get the URI of the output artifact representing the training logs,\n",
        "# which is a directory\n",
        "model_run_dir = trainer.outputs['model_run'].get()[0].uri\n",
        "\n",
        "%load_ext tensorboard\n",
        "%tensorboard --logdir {model_run_dir}"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "LgZXZJBsGzHm"
      },
      "source": [
        "## Model Serving\n",
        "\n",
        "Graph regularization only affects the training workflow by adding a regularization term to  the loss function. As a result, the model evaluation and serving workflows remain unchanged. It is for the same reason that we've also omitted downstream TFX components that typically come after the *Trainer* component like the *Evaluator*, *Pusher*, etc."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "qOh5FjbWiP-b"
      },
      "source": [
        "## Conclusion\n",
        "\n",
        "We have demonstrated the use of graph regularization using the Neural Structured\n",
        "Learning (NSL) framework in a TFX pipeline even when the input does not contain\n",
        "an explicit graph. We considered the task of sentiment classification of IMDB\n",
        "movie reviews for which we synthesized a similarity graph based on review\n",
        "embeddings. We encourage users to experiment further by using different\n",
        "embeddings for graph construction, varying hyperparameters, changing the amount\n",
        "of supervision, and by defining different model architectures."
      ]
    }
  ],
  "metadata": {
    "colab": {
      "collapsed_sections": [
        "24gYiJcWNlpA"
      ],
      "name": "neural_structured_learning.ipynb",
      "toc_visible": true
    },
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}
